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Abstract 


Graph kernels are historically the most widely-used technique 
for graph classification tasks. However, these methods suffer 
from limited performance because of the hand-crafted com- 
binatorial features of graphs. In recent years, graph neural 
networks (GNNs) have become the state-of-the-art method 
in downstream graph-related tasks due to their superior per- 
formance. Most GNNs are based on Message Passing Neu- 
ral Network (MPNN) frameworks. However, recent studies 
show that MPNNs can not exceed the power of the Weisfeiler- 
Lehman (WL) algorithm in graph isomorphism test. To ad- 
dress the limitations of existing graph kernel and GNN meth- 
ods, in this paper, we propose a novel GNN framework, 
termed Kernel Graph Neural Networks (KerGNNs), which 
integrates graph kernels into the message passing process 
of GNNs. Inspired by convolution filters in convolutional 
neural networks (CNNs), KerGNNs adopt trainable hidden 
graphs as graph filters which are combined with subgraphs 
to update node embeddings using graph kernels. In addi- 
tion, we show that MPNNs can be viewed as special cases 
of KerGNNs. We apply KerGNNs to multiple graph-related 
tasks and use cross-validation to make fair comparisons with 
benchmarks. We show that our method achieves competitive 
performance compared with existing state-of-the-art meth- 
ods, demonstrating the potential to increase the representation 
ability of GNNs. We also show that the trained graph filters in 
KerGNNs can reveal the local graph structures of the dataset, 
which significantly improves the model interpretability com- 
pared with conventional GNN model 


In recent years, the machine learning research commu- 
nity has devoted substantial energy to applying graph neural 
networks (GNNs) to numerous downstream graph-related 


tasks (Ying et al.|/2018} |Kipf and Welling) |2016 
2018} Chen, Li, and Brunaj|2017). Considering 


graph-structured tasks, the commonalities between differ- 
ent variants of GNNs are Message Passing Neural Net- 
works (MPNNs) (Gilmer et al./2017). MPNNs consist of two 
stages, including neighborhood aggregation and graph-level 
readout. Specifically, for neighborhood aggregation, there 
are three steps for each node to generate embeddings: (1) 
receiving messages from its neighbors, (2) aggregating mes- 
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Figure 1: (a) 1-WL graph isomorphism test cannot distin- 
guish one hexagon and two triangles because of same neigh- 
borhood multisets, while subgraph-based method can find 
the difference based on different subgraph topologies. (b) 
The yellow shadow represents the subgraph of node v. After 
interacting with graph filters, the updated node is colored in 
blue. 
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sages, and (3) updating its own features to encode the local 
structural information. For graph-level tasks, a permutation- 
invariant readout function is used to extract feature represen- 
tations from the entire graph. In fact, MPNNs have been mo- 
tivated and derived as a continuous and differentiable analog 
of the Weisfeiler-Lehman (WL) algorithm 
isfeiler}1968) which is known to successfully test graph iso- 
morphism for a broad class of graphs. However, recent stud- 


ies (Xu et al./2018a} |Morris et al.|2019) show that MPNNs 


are at most as powerful as the WL kernel (Shervashidze et al. 
2011) and WL algorithm regarding the graph isomorphism 


tests. This demonstrates theoretical limits in the expressiv- 
ity of popular GNNs. For example, Figure [Ifa) shows two 
graphs which cannot be distinguished by 1-WL algorithm, 
and therefore are also indistinguishable by MPNNs. 


Before the advent of GNNs, graph kernels were the most 
widely-used techniques for solving graph classification tasks 
(Kriege. Johansson, and Morris|2020). Graph kernels mea- 
sure the similarity between graphs, and can be applied into 
a kernel machine (e.g., support vector machine). Kernel 
functions remove the need of learning node embeddings 
in high dimensions, and enable us to operate in a high 
dimensional feature space by simply computing the ker- 
nel value in the low-dimensional feature space, which is 
more computationally efficient than computing in the high- 
dimensional space directly. Because of the empirical suc- 
cess of kernel-based methods and the increasing availability 
of graph-structured datasets, numerous graph kernel meth- 
ods have been proposed, including walks and paths ker- 


nels (Gartner, Flach, and Wrobel|/2003} |Kashima, Tsuda, 
and Inokuchi|2003;|Borgwardt and Kriegel|2005), subgraph 


kernels (Shervashidze et al.{[2009), and WL kernels 
vashidze et al.|2011). However, graph kernels still have lim- 


itations due to their hand-crafted features and fixed fea- 
ture construction scheme, which may not effectively capture 
high-dimensional information (e.g., complex node interac- 
tions) on large graphs. 

In this paper, to address the above-mentioned issues and 
increase the expressivity of GNNs, we propose a subgraph- 
based node aggregation algorithm by combining GNNs 
and graph kernels into one framework, and thus the advan- 
tages of both methods can be leveraged. On one hand, for 
neighborhood aggregation, we apply graph kernels which 
use the subgraph induced by node neighbors, so that the 
expressivity will not be limited by 1-WL isomorphism test 
which uses the multiset of neighboring nodes. An example 
is shown in Figure [ifa), where we note that node 1 in both 
graphs has the same neighborhood multiset but induce dif- 
ferent subgraph topologies with its neighbors which can be 
distinguished by graph kernels. On the other hand, we make 
the feature construction scheme of graph kernels trainable 
following the standard GNN training framework, possibly 
allowing for greater adaptability. 

Based on the subgraph-based node aggregation, we pro- 
pose a novel GNN framework, termed KerGNNs. Specif- 
ically, we first introduce a set of trainable hidden graphs, 
named graph filters, in each layer. Each node within the in- 
put graph is associated with a subgraph capturing its local 
topological information. We then adopt graph kernel func- 
tions to compare the similarity of graph filters and input 
subgraphs, and use the computed kernel values to update the 
respective node’s feature representations (as shown in Fig- 
ure[I[b)). We show that KerGNNs provide a new kernel per- 
spective to extend the standard CNN structure into the graph 
domain and generalize most MPNNs. The proposed model is 
then evaluated with various real-world graph and node clas- 
sification tasks, and the results show superior performance 
of KerGNNs compared with many existing state-of-the-art 
models. To better understand the predictions of GNN-based 
methods, KerGNNs can further visualize the trained graph 
filters, similar to visualizing filters in CNNs, and thus pro- 
vide better human-interpretable explanations for a variety of 
graph-related tasks, compared to existing GNNs. Our main 
contributions are summarized as follows: 


1. We use neighborhood subgraph topology combined with 
kernel methods for GNN neighborhood aggregation, and 
show with proof that the expressivity of this approach is 
not limited by the 1-WL algorithm. 


2. We provide a new perspective to generalize CNNs into 
the graph domain, by showing that both 2-D convolution 
and graph neighborhood aggregation can be interpreted 
using the language of kernel methods. 


3. Besides envisioning the output graphs of the model, 
KerGNNs can further reveal the local structure of the 
input graphs by visualizing the topology of trained 
graph filters, which significantly improves the model in- 
terpretability and transparency compared with standard 


MPNNs. 


Related Work 
Expressivity 


Several works have been devoted to improving the ex- 
pressivity of GNNs by introducing spatial, hierarchical, 


and higher-order GNN variants. For example, 
proposed the mix-hop structure which can learn 
a more general class of neighborhood mixing relationships. 
(2019) proposed to use Con- 
sistent Port Numbering GNN to augment the neighborhood 
aggregation, but port orderings are not unique and differ- 


ent orderings may lead to different expressivity. 
Groß, and Günnemann| (2020) leveraged the atom coordi- 


nate information in the molecular graph to improve the ex- 
pressivity, but the notion of direction is hard to generalize 
to more general graphs. used 
the graph homomorphism numbers as updated embeddings 
and show the expressivity of such graph classifiers with 
universality property, which unfortunately lacks neural net- 
work structure. Higher-order GNN variants have been stud- 
ied in Morris etal (2019) and Maron et al, (2019), which 
is more powerful than the 1-WL graph isomorphism test. 
However, higher-order methods always involve heavy com- 
putation and KerGNNs introduce a different way to break 
this 1-WL limit. 


Combination of Graph Kernel and GNNs 


Graph kernels and GNNs can be combined in the same 
framework. Some works apply graph kernels and neural net- 


works at different stages (Navarin, Tran, and Sperduti|2018 
Nikolentzos et al./2018). There are also works on using GNN 


architecture to design new kernels. For example, [Du et al] 
(2019) proposed a graph kernel equivalent to infinitely wide 
GNNs which can be trained using gradient descent. 


A different line of research focuses on integrating ker- 
nel methods into GNNs. mapped inputs to 
RKHS by comparing inputs with reference objects. How- 
ever, the reference objects they use lack graph structure and 
may not be able to capture the structural information. [Chen.] 
(2020) proposed GCKN which maps the 
input into a subspace of RKHS using walk and path ker- 
nel. While GCKN utilizes the local walk and path only start- 
ing from the central node, our model considers any walks 
(up to a maximal length) within the subgraph around the 
central node, and can thus explore more topological struc- 


tures. Another recent work by |Nikolentzos and Vazirgian- 
{nis|(2020) focused on improving model transparency by cal- 
culating the graph kernels between trainable hidden graphs 


and the entire graph. However, the method only supports a 
single-layer model and lacks theoretical interpretation. Our 
KerGNN model generalizes their scenario by applying hid- 
den graphs to extract local structural information instead of 
the entire graph, and therefore constructs a multi-layer struc- 
ture with better graph classification performance. 


Explainability 

Both graph structures and feature information lead to com- 
plex GNN models, making it hard for a human-intelligible 
explanation of the prediction results. Therefore, the trans- 
parency and explainability of GNN models are important 


issues to address. |Baldassarre and Azizpour| (2019) com- 


pared two main classes of explainability methods using in- 
fection and solubility problems. intro- 
duced explainability methods for the popular graph convolu- 
tional neural networks and demonstrated the extended meth- 
ods on visual scene graphs and molecular graphs. [Ying et al.| 
proposed a model-agnostic approach that can iden- 
tify a compact subgraph that has a crucial role in GNN’s 
prediction. In addition to visualizing output graphs as in 
regular GNNs, our KerGNN provides trained hidden graphs 
as a byproduct of training without additional computations, 
which contain useful structural information showing the 
common characteristics of the whole dataset instead of one 
specific graph, and can be helpful for interpreting the pre- 
dictions of GNNs. 


Background: Graph Kernels 


Graph kernels have been proposed to solve the problem of 
assessing the similarity between graphs, and therefore mak- 
ing it possible to perform classification and regression with 
graph-structured data. Most graph kernels can be written as 
the sum of several pair-wise base kernels, following the R- 
convolution framework (Haussler) 1999): 


K(G, G’) = 5 5 kpase(v, v"), (1) 


veEVv'EV! 


where G = (V, E), G’ = (V’, €’) are two input graphs with 
node attributes, and kpase can be any positive definite kernel 
defined on the node attributes. In this paper, we mainly con- 
sider random walk kernel which will be integrated into our 
proposed model in the next section. 

Random walk kernels are one of the most studied 
graph kernels. They count the number of walks that two 
graphs have in common, and were initially proposed by 
and|Kashima, Tsuda, and] 
(2003). Among numerous variations of the ran- 
dom walk kernel, we deploy the P-step random walk kernel 
which compares random walks up to length P in two graphs. 

Following Equation [I] we can write the base kernel of 
random walks with length p as 

khase (V, v’) 
ifp=0 
=) (aoha) AD E kao (us u’), ifp > 0 
uEN (v) u'EN (v') 
where A is the coefficient, N (v) denotes neighbors of v, p 
denotes the length of random walks which we compare in 
two graphs. If p = 0, the random walk kernel is equivalent 
to the simple node-pair kernel. To efficiently compute the 
random walk kernels, we follow the generalized framework 
of computing walk-based kernel (Vishwanathan et al.(2006), 
and utilize the direct product graph defined as below. 


Definition 1 (Direct Product Graph). For two labeled 
graphs G = (V,€) and G’ = (V’,€’), the direct prod- 
uct graph is defined as Gy = G x G! = (Vx, Ex), de- 
fined as Vx = {(v,v') :v E VAv' € VW} and Ex = 
{{(v,v’), (uu’)} : {vu} € EA {v',u'} € Ey 

Performing a random walk on the direct product graph 
Gx is equivalent to performing the simultaneous random 
walks on graphs G and G’. The P-step random walk ker- 
nel can be calculated as 


P P IVx] 
K(G,@) =) K,(G,6) => > Rly @ 
p=0 p=0 ij=l 


where Ax is the adjacency matrix of Gx and A = 
(Ao, Az, ---) is a sequence of weights. It should be noted that 
the (i, 7)-th element of A® (i.e., Ax to the power of p) rep- 
resents the number of common walks of length p between 
the 7-th and j-th node in Gx. 

To generalize the above formula into the continuous and 
multi-dimensional scenario, we first define the vertex at- 
tributes of the direct product graph Gx. Given the node at- 
tribute matrix X € R”*4 for a graph with n nodes and each 
node attribute is of dimension d, the node attribute matrix 
S of the direct product graph Gx = G1 x Gp is calculated 
as S = X,X?2, where X; € R™*¢ and Xp € R™*4 are 
the node attribute matrices for G; and G2, respectively, and 
S € R™*”2, The (i, 7)-th element of matrix S encodes the 
similarity between the 7th-node of G and the j-th node of 
Gə. We flatten S into vector s € R”!”2 for ease of nota- 
tion, and then integrate the encoded pair-wise similarity into 
Equation [2] 

IVx| 
K,(G,G') = Ñ` sis; [AX], =s As. 6) 


i, j=1 


Based on this equation, we can calculate the kernel value 
between two input graphs using the similarity of common 
walks as the metric. The details of calculating Equation [3] 
are included in Appendix. 

In practice, we also consider a slight variation of Equa- 
tion[I]by adding trainable weights to each base kernel term, 
and we call it deep random walk kernel: 


K(G, G”) = 5 5 Www) Kbase(V, v’), (4) 
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where wy ,,") represents the trainable weight assigned to the 
base kernel. 


Proposed Model 


In this section, we first discuss the framework of the pro- 
posed KerGNN model. Then we introduce the concept of 
subgraph-based neighborhood aggregation, and use it to an- 
alyze the expressivity of KerGNNs. Next, we show that 
KerGNNs are inspired by CNNs and compare them from the 
kernel perspective. Finally we argue that KerGNNs can gen- 
eralize MPNN architecture and analyze the time complexity. 


KerGNN Framework 


In this subsection, we introduce the KerGNN model which 
updates each node’s embedding according to the subgraph 
centered at this node instead of the rooted subtree patterns 
in MPNNs, as shown in Figure[I{b). Unless otherwise speci- 
fied, we refer to the subgraph as the vertex-induced subgraph 
formed from a node and all its 1-hop neighbors. 

We first define the embeddings of nodes and subgraphs, 
which are mapping functions from graphs to the feature 
space and from nodes to the feature space. 

Definition 2 (Feature mapping). Given a graph G = 
(V,€), a node feature mapping is a node-wise mapping 
function ¢ : V — R4, which maps every node v € V to 
a point 6(v) in R4, and ¢(v) is called the feature map for 
node v. A graph feature mapping is a function ® : G —> R?, 
where G is the set of graphs, and ®(G) is called the feature 
map for graph G € G. 

For an L-layer neural network, we call the input layer the 
0-th layer. At each hidden layer l, the input to this layer is 
an undirected graph G = (V, E), and each node v € V has 
a feature map ¢)_1(v) € R“-?. The output of layer l is the 
same graph G, because we do not consider graph pooling 
here, and each node v € V in the output graph has a fea- 
ture map ¢)(v) € R%. For example, G can be a graph in 
the dataset, and ¢o(v) is the node attributes with dimension 
do. For graphs with discrete node labels, the attributes can 
be represented as the one-hot encodings of labels and the 
dimension of attributes corresponds to the total number of 
classes. For graphs without node labels, we use the degree 
of the node as the node attribute. 

Inspired by the filters in CNN, we define a set of graph 
filters at each KerGNN layer to extract the local structural 
information around each node in the input graph (see Fig- 
ure[i{b)). 

Definition 3 (Graph filter). The i-th graph filter at layer l 


() 


is a graph HO w ni es It has a trainable adjacency 


matrix Al ) E Ro xn} ” and node attribute matrix wi? E€ 


At layer l, there are dı graph filters such that the output di- 
mension is also d;, and each node attribute in the graph filter, 


represented by each row of w, has the same dimension as 
the node feature map ¢;_1(v) in the input graph. 


KerGNN Layer. Now we consider a single KerGNN 
layer. We assume the input is a graph-structured dataset with 
undirected graph G = (V,€), and each node v € V has the 
attribute a(v) € R®. Then the input node feature map is 
go(v) = a(v). 

Each node v in the graph is equipped with a subgraph 
Gy = (Vs, Ev), and feature maps {¢o(u) : u € V,} are 
transformed to ¢1(v) in a way such that neighbors’ local 
information (topological information and node representa- 
tions) contained in G, will be aggregated to the central node 
v. We then rely on the graph filters {He : i = 1,...,d1} 
to obtain ¢1(v). Specifically, we calculate ¢)(v) by project- 
ing subgraph feature map ®9(G,,) into the i-th dimension of 
¢1(v) using the kernel function value between graph filter 


HY and subgraph Gy, i.e., 


pı lv) = 


where we adopt a random walk kernel as K (-,-), which is 
introduced in Equation [2] 2| After calculating the kernel be 


K(Gy, HS”), (5) 


of the subgraph G, with respect to every graph filter { Ho 
i = 1, ..., dı }, we obtain every dimension of node v’s feature 
map polv ), which forms the output of the KerGNN layer. 


It should be noted that using graphs G, and H w to calcu- 
late the kernel value is equivalent to performing inner prod- 
uct of ġı (G) and $1 (H®) in an implicit high-dimensional 
space, and using feature map of G, instead of the multiset 
of neighboring nodes (as used in MPNNs) improves expres- 
sivity, which is analyzed in the next subsection. Besides, 
if we use the output space R? to approximate the high- 
dimensional space introduced by the kernel method, the up- 
dating rule will correspond to the convolutional kernel net- 
work proposed by [Mairal et al.| (2014), and we will follow 
the same idea when we compare KerGNNs with CNNs in 
the later subsection. 


Multiple-layer Model. Based on the single-layer analysis 
above, we can construct a multiple-layer KerGNN by stack- 
ing KerGNN layers followed by readout layers. Specifically, 
the input to layer / is the graph G with node feature map 
{di-1(v) : v E€ G}. Layer l is parameterized by d; graph fil- 
ters {H0 : i = 1, ..., dı}. Each graph filter HO has a train- 
able adjacency matrix A) and node attributes wo, Then 
the i-th dimension of the output feature map for node v in G 
can be explicitly calculated as 


dii(v) = K(Gy, HO). (6) 


t 


The forward pass of the lth-layer of KerGNNs is summa- 
rized in Algorithm[I] 

For the graph classification, we then deploy the graph- 
level readout layer to generate the embedding for the entire 
graph. We obtain the graph representation at each layer by 
summing all the nodes’ representations. To leverage infor- 
mation from every layer of the model, we then concatenate 
the graph representations across all layers: 


®(G) = concat £ oy(v) 


vEG 


? 


mee (7) 


Expressivity of Subgraph-based Aggregation 


In this subsection, we first define the subgraph-based neigh- 
borhood aggregation, and discuss the requirements of the 
subgraph feature map to achieve higher expressivity than 1- 
WL algorithm, then we show that KerGNN is one of the 
models that satisfy these requirements. 

To leverage the structural information contained in the 
subgraph, we aggregate the subgraph information by find- 
ing a proper subgraph feature map ®(G,), and update the 
node representation of v combining the subgraph feature 
map with v’s own feature map. Formally, we define this ag- 
gregation process as follows. 


Algorithm 1: Forward pass in /-th KerGNN layer 


Input: Graph G = (V,€); Input node feature maps {91-1 (v) : 
v € V}; Graph filters {HO : i = 1,..., dı}; Graph kernel 
function K 


Output: Graph G = (V, E); Output node feature maps {¢ġ; (v) : 
veEV} 


forv € V do 
Gu = subgraph({v} UN (v)); 
for i = 1 tod; do 
piilu) = K(Gy, Hf”); 
end for 
end for 


Definition 4 (Subgraph-based aggregation). The graph 
neural network at layer | deploying subgraph-based neigh- 
borhood aggregation updates feature mapping ġ according 
to di(v) = u(di-i(v), f (®1-1(G»))), where u and f are 
update and aggregation functions, respectively. 

GNNs distinguish different graphs by mapping them to 
different embeddings, which resembles the graph isomor- 
phism test. characterize the representa- 
tional capacity of MPNNs using the WL graph isomorphism 
test criterion, and show that MPNNs can be as powerful as 1- 
WL graph isomorphism test if the node update, aggregation, 
and graph-level readout function are injective. We follow the 
similar approach and show in the following that subgraph- 
based GNNs like KerGNNs can be at least as powerful as 
the 1-WL graph isomorphism test. 

Because we are comparing the model’s expressivity with 
the 1-WL algorithm which updates node labels based on the 
multiset of neighboring nodes, to achieve high expressivity, 
it is natural to think that 6(G,,) should have a one-to-one 
relationship with respect to the multiset of nodes that sub- 
graph G, contains. We show in LemmafI]that the graph fea- 
ture map induced by the random walk kernel satisfies this 
condition. 


Lemma 1 if ®(G) is the feature map of graph G induced by 
the random walk graph kernel, then ®(G) is injective with 
respect to the multiset of all its contained nodes {a(v) : v € 
V(G)}, where {-}} denotes the multiset and a(v) is the label 
or attribute of node v. 


The proof follows directly from the random walk kernel 
definition in|Gartner, Flach, and Wrobel] (2003), and we no- 
tice that the graph feature map induced by the WL graph 
kernel also satisfies this lemma. Based on this injective rela- 
tionship between multiset and subgraph feature map, we can 
compare the expressivity of the subgraph-based GNN and 1- 
WL graph isomorphism test using the following theorem. 


Theorem 1 Let A : G — R? be a GNN with a sufficient 
number of GNN layers, if the following conditions hold at 
layer l: 

a) A aggregates and updates node features iteratively 
with oi(v) = u(di_-i(v), f (®1-1(Ge))), where function u 
and f are injective, and ®1_, is the feature mapping induced 
by the random walk kernel; 


b) A’s graph-level readout, which operates on the multiset 
of node features { 1(v) }, is injective; 
then A maps any graphs G and H that 1-WL test decides 
as non-isomorphic to different embeddings, and there exist 
graph G and H that 1-WL test decides as isomorphic, but 
can be mapped to different embeddings by A. 


The proof is shown in Appendix. This theorem shows that 
subgraph-based GNNs can be more expressive than the 1- 
WL isomorphism test and thus MPNNs. In the KerGNN 
model, we do not explicitly calculate the subgraph feature 
map ®(G.,) which lives in the high-dimensional space. In- 
stead, we apply the kernel trick and use the subgraph fea- 
ture map as K(G,,, H) = (®(G,), ®(H)). Then, the graph 
kernel function K(-, H) can be seen as a composition of 
functions u and f. Therefore, according to Theorem [I] to 
achieve high representational power, the graph kernel func- 
tion needs to be injective with respect to the subgraph feature 
map ®(G.,), and we introduce the following lemma to show 
that the KerGNN model satisfies this requirement. 

Lemma 2 There exists a feature map ®(H) so that 
K(H,G,) = (®(H),®(G.)) is unique for different 
®(G,). 

The proof is shown in Appendix. Besides, as shown in the 
definition of graph filters, in the KerGNN model we param- 
eterize the node feature and adjacency matrix of graph filter 
H instead of directly parameterizing (7). 


Connections to CNNs 


Standard CNN models update the representation of each 
pixel by convolving filters with the patch centered at it, and 
in GNNs, a natural analog of the patch in the graph domain 
is the subgraph. While many MPNNs draw connections with 
CNNs by extending 2-D convolution to the graph convolu- 
tion, we show in this subsection that both 2-D image con- 
volution and KerGNN aggregation process can be viewed 
as applying kernel tricks to the input image or graph, and 
therefore, the KerGNN model naturally extends the CNN 
architecture into the graph domain, from a new kernel per- 
spective. 

We first show in Appendix that under suitable assump- 
tions, the 2-D image convolution can be viewed as applying 
kernel functions between input patches and filters. The basic 
idea is that we can rethink the 2-D convolution as projecting 
the input image patch into the kernel-induced Hilbert space. 
The projection is done by performing inner product between 
the patch and basis vectors, which can be calculated using 
the kernel trick, and the projected representation in the out- 
put space will be the output of the CNN layer. 

Then we can extend the same philosophy to the graph do- 
main, by introducing subgraphs and topology-aware graph 
filters as the counterpart of patches and filters in CNNs, and 
KerGNN will adopt the kernel trick to project the input sub- 
graph representation into the output space (detailed in Ap- 
pendix). Based on these two observations, we can see that 
KerGNNs generalize CNNs into the graph domain by re- 
placing the kernel function for vectors with the graph kernel 
function, which provides a new insight into designing GNN 


architecture, different from the spatial and spectral convolu- 
tion perspectives. 


Connections to Existing GNNs 


As the subgraph of one node can be a more fruitful source of 
information than just the multiset of its neighbors, we show 
in this subsection that KerGNNs can generalize the standard 
MPNNs. From the point of view of KerGNNs, MPNNs de- 
ploy a simple graph filter with one node, and an appropri- 
ate kernel function can be chosen within KerGNN frame- 
work, such that KerGNNs iteratively update nodes’ repre- 
sentations using neighborhood multiset aggregation like in 
MPNNSs. For example, we show in Appendix that the node 
update rule of Graph Convolutional Network (GCN) 
can be treated as using one-node graph 
filters with properly-defined 7-convolution graph kernel. 
Our model generalizes most MPNN structures by deploying 
more complex graph filters with multiple nodes and learn- 
able adjacency matrix, and using more expressive and effi- 
cient graph kernels. 


Time Complexity Analysis 


Most MPNNs incur a time complexity of O(n”), or O(m) 
if the adjacency matrix is sparse containing m non-zero en- 
tries, because updating the embedding of node v involves n, 
neighbors, where n, is the degree of node v. In KerGNNs, 
we apply graph kernel with the subgraph G, instead of the 
whole graph, so the computational complexity would be 
related to the complexity of each subgraph. For the sub- 
graph G, with n, + 1 nodes and adjacency matrix with 
My non-zero entries, we update the representation of node 
v by calculating the random walk kernel with Equation [1] 
in Appendix. This calculation takes a computation time of 
O(Pd(d'ngr(ngr + my + 1) + my)), where P is the 
maximum length of the random walk, d and d’ are the 
node dimensions of the current layer and next layer, ngr 
is the number of nodes in each graph filter. In an undirected 
subgraph, m, represents the number of edges and will be 
greater than n, and smaller than n,(n, — 1)/2. If we sum 
up the computation time for all the nodes in the entire graph, 
the time complexity of KerGNNs will range between O (n?) 
and worst-case scenario (fully-connected graph) O (n°). We 
experimentally compare the running time of the proposed 
model with several GNN benchmarks. As shown in Table 
Blin Appendix, KerGNNs achieve better or similar running 
time compared to the fastest benchmark method, and much 
less running time than higher-order GNNs. 


Experiments 
We evaluate the proposed model on graph classification task 
and node classification task (discussed in Appendix), and we 
also show the model interpretability by visualizing the graph 
filters in the trained models as well as the output graphs. 


Experiment Settings 


Datasets. We evaluate our proposed KerGNN model on 
8 publicly available graph classification datasets. Specifi- 


cally, we use DD (Dobson and Doig||2003), PROTEINS 


(Borgwardt et al.}/2005), NCI1 (Schomburg et al.||2004), 
ENZYMES (Schomburg et al.|2004) for binary and multi- 


class classification of biological and chemical compounds, 
and we also use the social datasets IMDB-BINARY, IMDB- 


MULTI, REDDIT-BINARY, and COLLAB (Yanardag and 
VVishwanathan}2015). 


Setup. To make a fair comparison with state-of-the-art 
GNNs, we follow the cross-validation procedure described 
in (2019). We use a 10-fold cross-validation 
for model assessment and an inner holdout technique with 
a 90%/10% training/validation split for model selection, fol- 
lowing the same dataset index splits as[Errica et al.|(2019). 
Besides, we use Adam optimizer with an initial learning 
rate of 0.01 and decay the learning rate by half in every 50 
epochs. For the four social datasets, we use node degrees as 
the input attributes for each node, and for the four bio/chem- 
ical datasets, we use node labels or attributes as the input 
feature for each node. 


Hyper-parameters. The hyper-parameters that we tune 
for each dataset include the learning rate, the dropout rate, 
the number of layers of KerGNNs and MLP, the number of 
graph filters at each layer, the number of nodes in each graph 
filter, the number of nodes for each subgraph, and the hid- 
den dimension of each KerGNN layer. For the random walk 
kernel, we also tune the length of random walks. 


Baseline Models. We consider the KerGNN model with 
single and multiple KerGNN layers, namely KerGNN- 
L, corresponding to KerGNN model with L layers, and 
KerGNN-L-DRW representing the model deploying the 
deep random walk kernel. We also compare our models 


high-order GNNs: 1-2-3 GNN (Morris et al.|2019) and Pow- 


erful GNN (Maron et al.|2019). Part of the results for these 


baseline GNNs are taken from [Errica et al.](2019), and we 
run GCKN, 1-2-3 GNN and Powerful GNN using the official 


implementations. In addition, we also compare the proposed 
KerGNN model with three popular GNN-unrelated graph 
kernels: shortest path (SP) kernel (Borgwardt and Kriege 


2005), propagation (PK) kernel (Neumann et al.|2016| 
Weisfeiler-Lehman subtree (WL-sub) kernel 


and GNN-related GNTK 

use the GraKeL library (Siglidis et al.| to implement 
these graph kernels and run GNTK using the official imple- 
mentation. 


Results 


The graph classification results are shown in Table [I] with 
the best results highlighted in bold. We can see that the pro- 
posed models achieve superior performance than conven- 
tional GNNs with 1-WL limits, and achieve similar perfor- 
mance compared with high-order GNNs, with less running 
time. The single-layer KerGNN model performs well on 
small graphs like IMDB social datasets. For larger graphs, 
deeper models with more layers or with deep random walk 


— 


Table 1: Test set classification accuracies (%). The mean accuracy and standard deviation are reported. Best performances are 
highlighted in bold. OOR means Out of Resources, either time or GPU memory. 


DD NCI1 PROTEINS ENZYMES IMDB-B IMDB-M REDDIT-B COLLAB 
# GRAPHS 1178 4110 1113 600 1000 1500 2000 5000 
# CLASSES 2 2 2 6 2 3 2 3 
AVG. # NODES 284 30 39 33 20 13 430 74 
SP 78.7+3.8 66.342.6 71.9+6.1 25.0+5.6 57.5+5.4 40.5+2.8 75.5+2.1 58.4+£1.3 
PK 78.043.8 72.342.8 59.7+0.3 61.0+6.7 73.9+4.3 51.145.8 68.5+2.9 77.342.4 
WL-SUB 77.543.5  79.5+3.3 74.843.2 51.2+5.3 72.5+4.6 51.5+5.8 67.2+4.2 77.542.4 
GNTK OOR 83.5+1.2 75.542.2 48.2+2.4 75.943.1 52.2+4.2 OOR OOR 
DGCNN 76.6+4.3 76.44£1.7 72.94+3.5 38.9+5.7 69.24+3.0 45.6+43.4 87.842.5 71.2+1.9 
DIFFPOOL 75.043.5 76.94+1.9 73.743.5 59.5+5.6 68.4+3.3 45.643.4 89.1+1.6 68.9+2.0 
ECC 72.644.1 76.2+1.4 72.343.4 29.5+8.2 67.742.8 43.5+3.1 OOR OOR 
GIN 75.342.9 80.0+41.4 73.3+4.0 59.6+4.5 71.24+3.9 48.5+3.3 89.9+1.9 75.642.3 
GRAPHSAGE 72.94+2.0 76.0+1.8 73.0+4.5 58.2+6.0 68.8+4.5 47.643.5 84.3+1.9 73.9+1.7 
RWGNN 77.6£4.7 73.941.3 74.743.3 57.6£6.3 70.8+4.8 48.8+2.9 90.4+1.9 71.9+2.5 
GCKN 77.3£4.0 79.2+1.2 76.142.8 59.3+5.6 74.541.2 51.043.9 OOR 74.342.8 
1-2-3 GNN OOR 72.742.9 74.5+5.6 OOR 70.743.4 50.2+2.2 91.14+2.1 OOR 
POWERFUL GNN OOR 83.44£1.8 75.9+3.3 54.8+5.5 73.044.9 50.5+3.2 OOR 75.4+1.4 
KERGNN-1 77.643.7 74.342.2 75.843.5 62.1+5.5 74.4+4.3 51.643.1 81.5+1.9 70.5+1.6 
KERGNN-2 78.943.5 76.342.6 75.5+4.6 55.0+5.0 73.744.0 50.9+5.1 82.042.5 72.742.1 
KERGNN-3 75.5+3.1 80.541.9 76.5+3.9 54.1+4.3 72.144.6 50.144.5 82.0+1.9 71.142.0 
KERGNN-2-DRWŴ 77.044.4 82.8+1.8 76.1+4.1 59.5+4.5 71.144.1 50.5+3.1 89.5+1.6 75.142.3 


graph filters 


output graphs i input graphs 


* 
input graphs graph filters output graphs 


Figure 2: Model visualization. Input graphs are drawn from 
(a) MUTAG and (b) REDDIT-B datasets with different node 
shapes corresponding to different atom types. In both graph 
filters and output graphs, node color represents relative at- 
tribute value. 


kernel perform better. We show more experimental results, 
model parameter studies, and node classification results in 
Appendix. The optimal parameters of the graph filter are 
different for different datasets, depending on the local struc- 
tures of different types of graphs, e.g., the star patterns in 
graphs of REDDIT-B and the ring and chain patterns in 
graphs of NCI. 


Model Interpretability 


Visualizing the filters in CNNs gives insights into what fea- 
tures CNNs focus on. Following the same idea, we can also 
visualize the trained graph filters, which indicate some key 
structures of the input dataset. We visualize the graph filters 
trained with MUTAG[ | (Kersting et al.[2016) and REDDIT- 
B dataset in Figure [2] The MUTAG dataset consists of 188 


2We use the MUTAG dataset for visualization due to its eas- 
ily interpretable structure. However, we do not use this dataset in 
cross-validation because its number of graphs is too small. 


chemical compounds divided into two classes according to 
their mutagenic effect on a bacterium. As shown in the in- 
put graphs in Figure[2{a), most of the MUTAG graphs in the 
dataset consist of ring structures with 6 carbon atoms. 


KerGNNs and other standard GNNs can generate output 
graphs with updated node attributes, and we can extract im- 
portant nodes for the classification tasks using relative at- 
tribute values, which is shown in output graphs in Figure [2] 
We can make several observations from the output MUTAG 
chemical structures: 1) The carbon atoms at the connection 
points of rings are more important than those connected with 
atom groups, which are more important than those at the re- 
maining positions. 2) The atoms in the atom group are al- 
ways less important than those carbon atoms in the carbon 
ring. 

Compared to standard GNN variants, KerGNNs have 
graph filters as extra information to help explain the pre- 
dictions of the model. To visualize the graph filters, we ex- 
tract the adjacency matrix and the attribute matrix for each 
graph filter from the trained KerGNN layer. We then adopt 
the ReLU functions to prune the unimportant edges. In Fig- 
ure [2] we use different sizes of nodes to denote the relative 
importance of nodes. For the MUTAG dataset, we can see 
most of the graph filters have ring structures, similar to the 
carbon rings at the input graphs, and some graph filters have 
small connected rings, similar to the concatenated carbon 
rings. It should be noted that the number of nodes in the 
rings of graph filters may not be equal to 6 because we limit 
the total number of nodes to be 8. KerGNN layers utilize 
these rings in the graph filter to match against the local struc- 
tural patterns (e.g., carbon rings) in the input graphs, using 
the graph kernels. This indicates the importance of carbon 
rings in the mutagenic effect, which also corresponds to our 
observations in the output graphs. 


Conclusion 


In this paper, we have proposed Kernel Graph Neural Net- 
works (KerGNNs), a new graph neural network framework 
that is not restricted to the theoretical limits of the message 
passing aggregation. KerGNNs are inspired by several char- 
acteristics of CNNs and can be seen as a natural extension of 
CNNs in the graph domain, from the viewpoint of the kernel 
methods. KerGNNs achieve competitive performance on a 
variety of datasets compared with several GNNs and graph 
kernels, and can provide improved explainability and trans- 
parency by visualizing the graph filters and output graphs. 
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Calculation of Random Walk Kernel 
To calculate the random walk kernel between two graphs, we have to calculate Equation [B]in the main paper: 


K,(Gi, G2) = STA? s, (8) 


where Ax = Ag ® Ai, ® denotes the Kronecker product between two matrices, s = vec(S), and vec(-) denotes flattening a 
matrix into a vector by stacking all the columns. Besides S = X,x? € R™*"2, In the following, we assume G and G2 are 
undirected graphs, which is the case for all the datasets we use in the experiments. According to the properties of Kronecker 
product 


A% = A38 Al, (9) 

and 

vec(A XB) = (BT @ A)vec(X), (10) 
we can calculate Equation [8]as 

K,(Gi, G2) = s" A? s 

= sT (A> @ Al)s 

= vec (KX X2)* (AB @ A? vec (X1X2) 

= vec (xx vec ((A?)" XX7 AB) (11) 

= vec (KX X2)* vec (A?X(A2X2)") 


ny ng 


= SOY [(X:X7) © (ATK (45X2)")] > 


i=1 j=1 


where © means Hadamard (element-wise) product, In the KerGNN model, G and Go represent the graph filter and the input 
graph, respectively, and we can use Equation|I I] to avoid calculating the direct product graph. 


Proofs 
Proof of Theorem 1 


Let A be a graph neural network which satisfies condition a) and b). We first prove the first part of conclusions that A can 
map any graphs G and H that 1-WL test decides as non-isomorphic to different embeddings. Suppose starting from iteration 
l, the 1-WL test decides G and H are non-isomorphic (before that, 1-WL test cannot distinguish two graphs), but graph neural 
network maps them to the same embeddings A(G) = A(#). This indicates that G and H always have the same labels for 
iteration 7 — 1 and i for any 7 = 1,...,/ — 1 in the 1-WL test. Next we hope to reach a contradiction to this statement. To find 
this contradiction, we first show on graph G or H, if node features in the graph neural network ¢;(v1) = ¢;(v2), we always 
have 1-WL node labels a;(v1) = a;(v2) for any iteration i. This apparently holds for i = 0 because 1-WL and graph neural 
network start with the same node features. Suppose this holds for iteration j, if for any vı and v2, ¢j;41(v1) = $j+1(v2), then 
according to the node update rule in Theorem|I| we can get 


u (bj (v1), f (®j (Go. ))) = u (bj (v2), F (®j(Go2))) - (12) 
Because u and f are both injective, we then obtain 
(0; (v1), ®j (Gu, )) = (bj (v2), ®j (Gr.)) - (13) 


According to Lemma 1, if ®;(G,,) = ®j(G,.), then {@;(w),w E V(Gi,)} = {¢;(w),w © V(G»,)}, and because 
0; (v1) = ġ; (v2), we can get 


($5(v1), {b;(w), w EN (v1) B) = (65 (v2); Lo; (w), w E N(v2)}). (14) 
By our assumption at iteration 7, we must have 
(aj(v1), faj(w), w E N(v1)}) = (aj (v2), Lay (w), w € N(v2)})- (15) 


Because the mapping in 1-WL test is injective with respect to the node label and the multiset of neighborhood labels, we get 
a;41(V1) = aj+1 (v2). By induction, if node features in the graph neural network ¢;(v1) = ¢;(v2), we always have 1-WL node 
labels a;(v1) = a; (v2) for any iteration i. This creates a valid mapping q such that a;(v) = q(¢@,(v)) for any node v in the 
graph. 


Because 1-WL decides graphs G and H as non-isomorphic, which means {{a;(v),u € V(G)}} 4 {ar(v),v € V(A)}, at 
layer l. With the mapping between a;(v) and ¢;(v), we can get {¢:(v),v E€ V(G)} F {di(v),u € V(A)}. Because the 
graph-readout function of graph neural network is injective according to Theorem|1| we should get A(G) #4 A(H), which 
contradicts our assumption. 

For the second part of the conclusion that there exist graph G and H that are decided as isomorphic by 1-WL test but non- 
isomorphic by subgraph-based graph neural network .A, to prove it, we can just find an example that satisfies this. The example 
shown in Figure[I{a) cannot be distinguished by 1-WL graph isomorphism test, but for the subgraph associated with each node, 
the random walk graph kernel can embed them to different embeddings, by interacting with an appropriate graph filter (e.g., a 
graph filter with one node). 


Proof of Lemma 2 


To find at least one feasible ®( H), assume the length of non-zero vector (G) is large but finite ®(G,) = [co, c1,..., cn] with 
maximum absolute value c, then we can encode each value of ®(G,,) with the base 2c according to their positions in the vector. 
Specifically, we can let 6(H) = [(2c)®, (2c)', ..., (2c)], and the inner product K(H, G,) = (®(H), ®(Gy)) = Sy ci(2c)* 
will be injective with respect to ®(G,). 


Connections between KerGNNs and CNNs 


In this section, we discuss the claim in the paper that KerGNNs generalize CNNs into the graph domain from the kernel’s 
point of view. We show in the first subsection that 2-D image convolution in CNNs is equivalent to calculating the appropriate 
kernel function between patches and filters. Then we show in the second subsection that KerGNNs generalize this aggregation 
approach by introducing the counterparts of patch, filter and convolution in the graph regime. 


Rethinking CNNs from the Kernel’s Perspective 


Standard CNN models use image convolution to aggregate the local information around each pixel. In this subsection, we show 
that the image convolution can be viewed as applying kernel functions between input patches and filters in the convolutional 
layers. 

Given a convolutional layer in CNN, the input to the layer is an image 7 C Q, where Q C [0, 1]? is a set of pixel coordinates. 
Typically, Q is a two-dimensional grid. We also define a feature mapping function to map every pixel in the image to a finite 
vector space, ¢: J — R4, where d is the dimension of the input feature space. For each pixel q € I with coordinate (x, y), 
we can find a neighborhood patch x, centered at the pixel q, with patch size r x r. ¢(q) is the feature map of the pixel in 
R4, and with some abuse of notation, #(X,) is defined as the concatenation of feature maps of every pixel in the patch Xz, i.e., 


p(x) = lla); for every pixel q; € X,. Thus, ¢(x,) lives in the vector space RX” For example, given an RGB image as 


the input, ¢(q) is in the Euclidean space R?, and ¢(x,) is the feature vector in 3x", 

Next, we represent the output of the layer as a different image J’ C © with the feature mapping function ¢’: I’ > R”, where 
d’ is the dimension of the output feature space. As shown in Figure Bla), I and I’ may not have the same size, but every pixel 
q’ € T' corresponds to a pixel q € J with an associated patch x,. The goal of the convolutional layer is then to learn this output 
feature mapping function ¢’. Specifically, the convolutional layer adopts d’ filters to perform 2-D convolution operation over 
the image. The 7-th filter performs the dot product over each patch of the image with a fixed stride, and can be parameterized 
as a vector 2; € RX", The output feature is obtained by computing the dot product of z; and ¢(x,) and followed by an 
element-wise nonlinear function ø. In other words, it is the i-th dimension of the feature representation 9’ (q’) of the pixel q’ in 
the new image J’. The process can be written as follows: 


pild’) = o(@* Zi) 


d i 
=o (>. 5 Olt, x — my- nZ) 


= o(($(x4)  2i)) = o(z7 O(Xq)), (16) 


where * represents the image convolution, ® and Z; are tensors with shape (d x r x r) reshaped from vectors $(x,) and z;, 
respectively. We omitted the bias term here for simplicity. 

Now, we rethink this process from the kernel perspective. According to the theory of kernel methods, each positive definite 
kernel function K implicitly defines a RKHS H. Next, we try to make this implicit RKHS to be the output feature space of this 
convolutional layer, by appropriately designing the associated kernel function. Here, we can define a simple dot-product RBF 
kernel function between the input feature map vector $(x,) and the filter vector z; as 


K(6(%q), 21) = exp(z7 $(Xq)). (17) 
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Figure 3: Comparisons of filters in CNN and KerGNNs. (a) The yellow grids represent the input image, and the pixels in 
the red box represent the patch around pixel q. The blue grids represent the filter. The position of pixel q in the output grid is 
denoted by q’. (b) The graph denoted in the yellow shadow represents the subgraph of node v. The graph with blue shadow 
represents the graph filter. 


It is worth noting that RKHS 1 determined by this RBF kernel is of infinite dimensions, and thus we need to make the 
assumption here that H can be approximated by a finite output vector space, with a finite set of basis vectors which are defined 


as trainable filters {z; : i = 1,...,d’} and H = Span(z1, ..., zy). This is similar to the assumption made by (Mairal et al.|2014). 


It is noted that the output feature space can also be approximated by a subspace of H using Nyström method as in 
(2076). 

Then we can derive the output feature map of the pixel q’ by projecting the input feature map vector $(x,) into H, and thus 
the i-th dimension of the output feature map ¢'(q') can be computed as the projection of ¢(x,) onto the i-th basis vector z; as 


pild) = ($a) 5 zin 
= K(ọ, zi) = exp(z7 $(Xq)), 


where the second equality holds because of the kernel method. Then we obtain the formula to calculate the output feature 
mapping function from a perspective that is different from image convolution, and the result is similar to Equation{I6] The only 
difference is in the element-wise nonlinear activation function, and it is worth noticing that the exponential nonlinearity induced 
by the RBF kernel is very similar to the popular ReLu function o(-) used in practice. 

Now we come to the conclusion that under suitable assumptions, the standard image convolution in CNN layers can be 
approximately interpreted as applying kernel functions between the input patches and the trainable filters, and we can use the 
computed kernel value to update the feature map of the pixel in the output space. This is our major motivation for the design 
of KerGNN. In KerGNN, we use the computed kernel values to update the feature map of the node in the output graph. This 
comparison of convolutional layers and KerGNN layers is shown in Figure B] 


(18) 


Generalizing CNNs to KerGNNs 


We explain in this subsection how we derive the update rule shown in Algorithm [I] inspired by CNNs. We consider a single 
KerGNN layer with an undirected graph G = (V,€) and node feature map of the input graph G can be characterised as 
do : V + R® as input. 

We rely on the graph filters (as the counterpart of filters) { H. Oi = 1,...,dı} to aggregation each node’s subgraph (as 
the counterpart of the patch) and obtain ¢;(v). Specifically, we consider using the R-convolution typed kernel function which 
implicitly defines an RKHS #1. Similar to the analysis of CNN from the kernel perspective (as discussed in the first subsection), 
we assume that Hı can be approximated by the finite-dimensional space R“ with a set of dı vectors {®(H. Ohi = hedik 
so that Hı = Span(®(H®), det o(H\")), Then, we calculate ġ: (v) by projecting subgraph feature map ®o(G,) into H1, and 


the i-th dimension of ¢: (v) can be calculated as the inner product between ®9(G,) and the basis vector ®(H. my i.e., 
b1i(v) = (o(Ge), DH jia ~ K (Go, Hy”); (19) 


where we adopt random walk kernel as K. Similar to our conclusions for CNNs, Equation [19] indicates that we can use the 
kernel values computed by the subgraph of a node and graph filters as the output feature map of the node. After calculating 
the kernel value of the subgraph G, with respect to every graph filter {H Di = 1,...,d,}, we obtain every dimension of the 
feature map of node v in H1, which is the output of the KerGNN layer. 

We compare CNNs and KerGNNs in Figure [3] and summarize the similarities in the following: 

Sliding over Inputs. In CNN, the filter is systematically applied with each filter-sized patch of the input image, from left to 
right and top to bottom, with a specified stride. In KerGNNs, we sample a subgraph G, = (V,, €,) consisting of v and its j-hop 


neighbors, and the graph filter is applied to G, for Vu € V (the adjacency of G is preserved in the subgraph). It should be noted 
that the operations defined below do not require the subgraph and graph filter have the same number of nodes or topology. 

Shared Parameters. In CNN, all the patches share the same filter for convolution to reduce the number of parameters. In 
KerGNNs, all the subgraphs G, for Vv € V also share the same graph filter, and thus the parameters of graph filters will not 
scale up as the input graph becomes larger. 

Local Aggregation. In CNN, the patch consists of a central pixel and neighboring pixels, and the feature map of the patch 
#(X,) contains all the neighbors’ information. The filters aggregate the patch and assign the corresponding kernel value to the 
output feature map of the central pixel. In KerGNNs, we use the graph filters to aggregate the feature map of the subgraph 
®(G,), and let the kernel value be the output feature map of the central node. 


Connections between KerGNNs and MPNNs 


We show in this section that KerGNNs generalize the standard MPNNs. From the point view of KerGNNs, MPNNs deploy 
a simple graph filter with one node, and an appropriate kernel function can be chosen with KerGNN framework, such that 
KerGNNs can iteratively update nodes’ representations using neighborhood aggregation like in MPNNs. For example, the 


vertex update rule of Graph Convolutional Network (GCN) (Kipf and Welling/2016) can be written as 


ızı (v) 
u)=0 | W ? 
ou seo ote eo 


where N (u) represents all the neighboring nodes of u, ø is the element-wise nonlinear function, ¢;_1(v) € R®-! and ¢)(v) € 
IR”. To interpret this updating rule in the KerGNN framework, we can define the subgraph G, containing nodes N (u) U {u}. 


We also define dı graph filters {Hi = 1, ..., dı}, and each graph filter is defined as HO = ({hi}, {}). The attribute of the 


node h; is parameterized by w® 


(20) 


€ R!*4-1, Then we define an R-convolution graph kernel as 
K(Gu, HO”) = 5 5 kbase(v, v’) 
vEN (u)U{u} v'E{hi} 


= 5 Kbase (v, hi), 


vEN (u)U{u} 


(21) 


and the node-wise kernel kpase is defined as 


T 
WY” ilo) 


k ase v, hi = 0 
pasii MORMO 


(22) 


Using the KerGNN framework, the i-th dimension of the output feature map ¢,,;(u) can be written as the kernel function 
between G, and HO: 


piilu) = K(Gu, H®) 


WO” pl) 
MORMONA 


vEN (u)U{u} 


=o wor di-1(v) 


i i IN (u)| ` IN (v)| l 


which is equivalent to Equation[20] Therefore, the message aggregation in most MPNNs can be treated as using one-node graph 
filters in KerGNNs, and our proposed method generalizes MPNNs by deploying more complex graph filters with multiple nodes 
and learnable adjacency matrix. 


(23) 


Model Implementation Details 


To construct the subgraph for a node, we first calculate all the 7-hop neighbors of the node, and extract the subgraph determined 
by the node and all its neighbors. In implementation, to be compatible with matrix multiplication, we set a maximum size of 
subgraphs. Any subgraph exceeding this limit is truncated and preserves nearer neighbors. The adjacency matrix of a subgraph 
that does not reach this limit will be padded with zeros. Therefore, the adjacency matrices of all the subgraphs have the same 
size. 


For the first layer of the KerGNN model, we optionally add an additional linear mapping to transform node attributes of 
the input graph to a specified dimension (the dimension of node attributes of graph filter in the first layer), such that we can 
calculate the graph kernel between input graphs and graph filters at the first layer. 

For simplicity, we define all the graph filters at the same layer to have the same number of nodes. Besides, for comparison, 
we only use one type of graph kernels within one model, although different graph filters can interact with same subgraphs with 
different types of graph kernels and thus different graph kernels can be mixed within one model. 


Experiment Details and More Results 
Graph Classification Task 


Hyper-parameter Search. We conduct the experiment using Intel(R) Core(TM) 17-7700 CPU @ 3.60GHz CPU with 
NVIDIA GPU (GeForce RTX 2070). We use grid search to select the hyper-parameters for each model during cross-validation, 
and the hyper-parameter search range is shown in Table[2| 


Table 2: Hyper-parameter search range. 


hidden dimension of the first layer [8; 16; 32; 64] 

number of graph filter [16; 32; 64; 128] 
number of nodes of graph filter (2; 4; 6; 8; 10; 12; 14; 16; 18; 20] 
maximum number of nodes for subgraph [5; 10; 15; 20; 25; 30] 
j-hop neighborhood [1; 2; 3] 

maximum step for random walk [1; 2; 3; 4; 5] 

hidden dimension of linear layer [8; 16; 32; 48; 64] 
dropout rate [0.2; 0.4; 0.6; 0.8] 


Details of Datasets. We use 5 bioinformatics datasets: MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic 
nitro compounds with 7 discrete labels. PROTEINS dataset uses secondary structure elements (SSEs) as nodes and two nodes 
are connected if they are neighbors in the amino-acid sequence or in 3D space. It has 3 discrete labels, representing helix, sheet, 
or turn. NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets of 
chemical compounds screened for ability to suppress or inhibit the growth of a panel of human tumor cell lines. DD is a dataset 
of 1178 protein X-Ray structures. Two nodes in a protein are connected by an edge if they are less than 6 Angstroms apart. The 
prediction task is to classify the protein structures into enzymes and non-enzymes. 

We use 4 social datasets: IMDB-BINARY and IMDB-MULTI are movie collaboration datasets. Each graph corresponds to 
an ego-network for each actor/actress, where nodes correspond to actors/actresses and an edge is drawn between two actors/ac- 
tresses if they appear in the same movie. Each graph is derived from a pre-specified genre of movies, and the task is to classify 
the genre that the graph is derived from. IMDB-BINARY considers two genres: Action and Romance. IMDB-MULTI consid- 
ers three classes: Comedy, Romance, and Sci-Fi. Each graph of REDDIT-BINARY dataset corresponds to an online discussion 
thread and nodes correspond to users. Two nodes are connected if at least one of them responded to another’s comment. The 
task is to classify each graph to a community or a subreddit it belongs to. COLLAB is a scientific collaboration dataset, de- 
rived from 3 public collaboration datasets, namely, High Energy Physics, Condensed Matter Physics, and Astro Physics. Each 
graph corresponds to an ego-network of different researchers from each field. The task is to classify each graph to a field the 
corresponding researcher belongs to. 


Model Parameters. We study how the test accuracy is influenced by the number of nodes in the graph filters, which deter- 
mines the size and complexity of the graph filters. As shown in Figure Ha), the optimal size of the graph filter is different for 
different datasets, depending on the local structures of different types of graphs, e.g., the star patterns in graphs of REDDIT-B 
and the ring and chain patterns in graphs of NCI1. We also study the influence of the maximum length of random walks as 
shown in Figure[4{b), it can be seen that longer walks generally benefit the classification results except for ENZYMES dataset 
where the model performs the best with walk of length 2. In the code implementation, we specify the maximum number of 
nodes that the subgraph can contain, which implicitly control the size of the subgraphs, and thus we also study the influence 
of this threshold of node numbers in Figure Ho. DD and ENZYMES achieve higher accuracy with larger subgraphs, be- 
cause larger subgraph contains more fruitful neighborhood topology information. The remaining datasets are not influenced too 
much, because we fix the size of graph filters, and the model performance degrades when the subgraph size and graph filter size 
mismatch, for example, small graph filters cannot handle larger subgraphs. 


Model Running Time. We compare the running time of the proposed model with several GNN benchmarks, as shown in 
Table |3} We measure the one-epoch running time averaged over 50 epochs, using the same GPU card. All the models are 
set to have hidden dimension 32 and 1 MLP layer. For KerGNNs, the number of nodes in the graph filter is set to 6 and the 
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Figure 4: Test accuracy w.r.t. model parameters. The results are obtained from experiments on 1 fold of datasets, and for all 
the three graphs only the model parameter on the x-axis is changed with remaining model parameters fixed. 


maximum subgraph size is set to 10. We observe that KerGNN gives much better running time compared with high-order GNN 
benchmarks, and achieves similar performance compared with conventional GNN models with 1-WL constraints. 


Table 3: The measured one-epoch running time (s) of different GNN models. 


DD NCI1 PROTEINS ENZYMES IMDB-B IMDB-M REDDIT-B COLLAB 

DGCNN 0.271+0.005 0.425+0.009 0.134+0.006 0.069+0.002 0.113+0.007 0.16140.008  0.411+0.010 0.824+0.053 
DIFFPOOL 2.887+40.029  4.232+0.035 1.125+0.006 0.13140.007 0.21040.031 0.295+0.013 9.580+0.233 2.296+0.088 
ECC 110.14+ 4.31 1.179+0.012 = 0.550+0.012 0.23340.005  0.254+0.007 0.3040.050 OOR OOR 

GIN 0.344+0.003 0.443+0.007 =0.13340.006 0.07240.004 0.118+0.032 0.160+40.008  0.740+0.053 0.809+0.044 
GRAPHSAGE 0.141+0.002 0.33140.014 0.088+0.003 0.047+0.002 0.081+0.004 0.122+0.009 0.198+0.011 0.589+0.026 
1-2-3 GNN OOR 1.228+0.075 1.056+0.058 OOR 1.30140.071 1.7540.075 0.608+0.025[] OOR 

POWERFUL GNN OOR 19.764+0.270 4.24+0.23 1.997+0.150 2.701+0.084 3.417+0.092 OOR 17.578+0.621 
KERGNN-1 0.732+0.041 0.445+0.033 0.183+0.034 0.37040.030 0.092+0.039 0.086+40.026 0.865+0.051 1.336+0.031 
KERGNN-2 1.486+0.078 0.688+0.038  0.373+0.043 0.085+40.019 0.148+0.044 0.225+0.036 1.677+0.051 2.634+0.047 
KERGNN-3 2.748+0.078 0.876+0.044 0.553+0.026 0.12640.024 0.196+0.019 0.34340.030 2.551+0.065 3.944+0.046 
KERGNN-DRW 1.782+0.051 0.753+0.034 0.405+0.045 0.090+40.025 0.155+0.036 0.240+40.036 2.078+0.060 2.648+0.039 


Node Classification Task 


We further evaluate the proposed KerGNN model for node classification task. we use 4 datasets: Cora, Citeseer, Pubmed 
2008), Chameleon (Rozemberczki, Allen, and Sarkar|2021). For each dataset, we randomly split nodes of each class into 


60%, 20%, and 20% for training, validation and testing, and report the mean accuracy of all models on the test sets over 10 
random splits. We compare our model with several popular GNNs including GCN, GAT (Veličković et al./2017), GEOM-GCN 
(Pei et al.|2020), APPNP (Klicpera, Bojchevski, and Giinnemann|2018), JKNet (Xu et al.|2018b) and GCNII (Chen et al.[2020). 
As shown in Tabld4| KerGNNs achieve similar or better results compared with SOTA baselines. 


Table 4: Node classification accuracy. 


CORA CITESEER PUBMED CHAMELEON 


GCN 85.77 73.68 88.13 28.18 
GAT 86.37 74.32 87.62 42.93 
GEOM-GCN 85.27 77.99 90.05 60.90 
APPNP 87.87 76.53 89.40 54.30 
JKNET 87.46 76.83 89.18 62.08 
GCNI 88.49 77.08 89.57 60.61 


KERGNN 87.96 76.61 89.53 62.28 


More Visualizations 
In this section, we show the graph filters trained for MUTAG dataset and REDDIT dataset in Figures [5] and [6] We can see 
that the trained graph filters have different patterns for the two datasets, and each type of patterns reveals the characteristics 
of corresponding dataset. Specifically, graph filters for MUTAG tend to have ring and circular patterns, while graph filters for 
REDDIT tend to have star patterns. 
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Figure 5: Graph filters of MUTAG 


